Skip to content

perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80

Merged
dpsoft merged 4 commits intomainfrom
slirp-perf-handle-tcp-frame-followups
May 6, 2026
Merged

perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80
dpsoft merged 4 commits intomainfrom
slirp-perf-handle-tcp-frame-followups

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented May 6, 2026

Summary

Three follow-ups identified by profiling on PR #79's tcp_bulk_throughput_1mb flame graph. Each targets a specific bottleneck in handle_tcp_frame and the surrounding NAT path.

Stacked on #79. Land that first, then this rebases onto post-#79 main as a small standalone PR.

What changed

1a — Extract handle_tcp_syn_outbound from handle_tcp_frame

handle_tcp_frame was 595 lines with the outbound-SYN handler taking ~265 of them. Profiler attribution was muddy because the SYN setup was folded into the hot data-path's cost. Extract the SYN handler as a #[cold] method and reduce handle_tcp_frame to ~330 lines. Pure refactor, no behavior change.

#[inline(never)] added to handle_tcp_frame itself so the now-smaller function isn't inlined into handle_ipv4_frame (which was masking attribution again post-extraction).

1b — Switch flow_table + token_to_key to FxHash

std::HashMap defaults to SipHash for DoS resistance against attacker-influenced keys. The SLIRP flow table is keyed by guest-side ports the guest itself chooses on its own networking — same trust boundary as the guest kernel. SipHash buys nothing.

rustc_hash::FxHashMap: same API (Entry, Default), drop-in replacement. The dns_cache stays on std::HashMap because its keys are attacker-influenced (raw DNS query bytes).

1c — Mark cold paths

Added #[cold] on handle_tcp_syn_outbound and cold_branch() calls (a #[cold] #[inline(never)] no-op stub) at:

  • LastAck → Closed
  • FIN-from-guest match block
  • RST-from-guest
  • ACK-driven read failure
  • Write-to-host failure

std::hint::cold_path() is still unstable on rustc 1.93, so the stub is the portable equivalent.

Bench evidence — divan vs current main

scripts/bench-compare.sh --baseline origin/main --skip-vm:

Bench Δ% Note
flow_table_insert_remove/1000 -22.1% FxHash win on the targeted bench
flow_table_insert_remove/100 -6.3% same, smaller N
nat_translate_outbound_hot_path -14.0%
tcp_rx_latency_one_packet -7.3%
poll_with_n_flows/1000 -3.4%
process_syn_during_pending_connects/100 -14.7%

Several other benches showed run-to-run noise of ±5–20% in the same direction across multiple measurements — bench machine variance, not signal.

PMU comparison — Phase 6.3 baseline (post-cache-fix) vs followups

Same 30s capture, same workload, single bench process, perf-agent eBPF:

Metric Phase 6.3 alone Phase 6.3 + followups Δ
IPC 0.786 0.788 +0.3%
Cache misses / 1K instr 3.924 3.867 -1.5%
Total Instructions (30s) 21.50 B 21.29 B -1.0%
Total Cycles 27.35 B 27.01 B -1.2%
Total Cache Misses (abs) 84.36 M 82.32 M -2.4%
P99.9 on-CPU 9.41 ms 9.73 ms within noise

Modest, consistent direction across all PMU dimensions: more work per cycle, slightly fewer cache misses, less total work. The HashMap → FxHash swap accounts for most of the instruction-count drop.

Profile attribution restored

Pre-1a: handle_tcp_frame at 25% flat — the function was a 595-line monolith and the profiler couldn't attribute internal cost.

Post-followups: handle_tcp_frame at 27.59% flat (the +2.6 pp delta is the #[inline(never)] cost), but the SYN setup path now shows up as handle_tcp_syn_outbound in flow-creation benches, separately attributed.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features
  • cargo test --test network_baseline -- --test-threads=1 — 24/24
  • cargo test --test network_baseline --features bench-helpers -- --test-threads=1 — 26/26 (3/3 stable runs)
  • scripts/bench-compare.sh --baseline origin/main --skip-vm — table above
  • CI

Follow-ups (not in this PR)

  • passt head-to-head benchmark harness — separate PR (deferred per docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md § "passt head-to-head methodology").
  • Larger structural reshape of handle_tcp_frame if profiling against passt shows we still trail meaningfully on CRR.

@dpsoft dpsoft requested a review from Copilot May 6, 2026 15:00
Base automatically changed from phase6.3-window-mgmt-rebased to main May 6, 2026 21:14
dpsoft added 4 commits May 6, 2026 18:17
handle_tcp_frame was 595 lines, with the SYN handler taking ~265 of them.
Profiling shows the function at ~25% flat CPU on tcp_bulk_throughput_1mb
but with no callee attribution — the SYN setup path is folded into the
hot data path's profile, masking which arm is the actual bottleneck.

Extract the outbound-SYN handler into its own method. Pure refactor,
no behavior change, no API surface change. The dispatcher remains a
single if-tail-call so the data-path body is back to ~330 lines and
the profiler can attribute SYN setup separately when running
flow-creation benches.
The std HashMap uses SipHash for DoS resistance against
attacker-influenced keys.  The SLIRP flow table is keyed by
(guest_src_port, dst_ip, dst_port) — values the guest itself chooses
on its own networking — and lives behind the same trust boundary as
the guest kernel.  SipHash buys nothing here.

rustc-hash's FxHash is a fast non-cryptographic hash designed for
short integer-tuple keys.  Same API as std HashMap (Entry-API,
Default), drop-in replacement.

Expected win: a few ns per probe, multiplied by every flow_table
lookup on the data path (one per incoming TCP/UDP/ICMP frame).

Production change scoped to the two SLIRP maps: dns_cache stays on
std HashMap because its keys are attacker-influenced (raw DNS query
bytes) and DoS resistance matters there.
Profiling shows handle_tcp_frame at ~25% flat CPU on
tcp_bulk_throughput_1mb, with the hot path being Established + payload
+ ACK and FIN/RST/error-close arms firing on a tiny fraction of
frames.  Hint the compiler so it keeps cold arms out of the hot
inline window.

- handle_tcp_syn_outbound: marked #[cold]; SYN setup is one syscall
  per connection vs ~10k frames/s on bulk.
- LastAck → Closed, FIN-from-guest, RST-from-guest, ACK-driven read
  failure, write-to-host failure: tagged via cold_branch() so the
  optimiser spills these arms.

cold_branch() is a #[cold] #[inline(never)] no-op stub.
std::hint::cold_path() is still unstable on rustc 1.93, so the stub
is the portable equivalent: the call site is treated as a rare
branch and the surrounding code is laid out with the cold arm
spilled away.
…ution

After the SYN extraction shrunk handle_tcp_frame by ~265 lines, the
optimiser started inlining it into handle_ipv4_frame, hiding the
data-path cost behind the upstream caller's frame.  Mark it
#[inline(never)] so it stays a separate frame in profiles.  Cost is
~2-3 ns of call overhead per incoming TCP frame, which at ~10k
frames/s on bulk throughput is ~0.003% of CPU — well below noise.
@dpsoft dpsoft force-pushed the slirp-perf-handle-tcp-frame-followups branch from 269c7d4 to 83fdf01 Compare May 6, 2026 21:18
@dpsoft dpsoft merged commit db05134 into main May 6, 2026
22 checks passed
@dpsoft dpsoft deleted the slirp-perf-handle-tcp-frame-followups branch May 6, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant